A hypothesis class H is PAC-learnable if there exists a learning algorithm with the following property: For every ϵ,δ∈(0,1), every distribution D over X, and every h∈H, when the algorithm is given m(ϵ,δ) samples drawn from D and labeled by h, then the algorithm produces a hypothesis h^ such that with probability 1−δ,L(h^)≤ϵ. (Note that the probability is over randomness in the training set as well as any internal algorithmic randomness).
Theorem 1 (PAC bound for finite hypothesis class)
Let H be a hypothesis class with finite size ∣H∣. Then H is PAC-learnable with
m(ϵ,δ)=O(ϵlog(∣H∣/δ))
The proof is shown in the previous note.
We can solve for m and say that if m≥ϵlog(∣H∣/δ), the probability of failure in PAC learning is at most δ, and hence we succeed w.p. 1−δ.
A hypothesis class H is agnostic PAC learnable if there exist a function mH:(0,1)2→N and a learning algorithm with the following property: For every ϵ,δ∈(0,1) and for every distribution D over X×Y, when running the learning algorithm on m≥mH(ϵ,δ) i.i.d. examples generated by D, the algorithm returns a hypothesis h such that, with probability of at least 1−δ (over the choice of the m training examples),
Assume that a training set S is 2ϵ-representative (w.r.t. domain Z, hypothesis class H, loss function l, and distribution D). Then, any output of ERMH(S), namely, any hS∈argminh∈HLs(h), satisfies
We say that a hypothesis class H has the uniform convergence property (w.r.t. a domain Z and a loss function l) if there exists a function mHUC:(0,1)2→N such that for every ϵ,δ∈(0,1) and for every probability distribution D over Z, if S is a sample of m≥mHUC(ϵ,δ) examples drawn i.i.d. according to D, then, with probability of at least 1−δ, S is ϵ-representative.
If a class H has the uniform convergence property with a function mHUC then the class is agnostically PAC learnable with the sample complexity mH(ϵ,δ)≤mHUC(ϵ/2,δ). Furthermore, in that case, the ERMH paradigm is a successful agnostic PAC learner for H.
In (3), we have shown that H has the uniform convergence property, and thus by UC, we have H is agnostic PAC learnable with m≥⌈2ϵ2log(2∣H∣/δ)⌉ samples.